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Abstract 

This paper provides a formal analysis of a powerful mapping technique known as scatter decomposi- 
tion,. Scatter decomposition divides an irregular computational domain into a large number of equal sized 
pieces, and distributes them modularly among processors. We use a probabilistic model of workload in 
one dimension. to formally explain why, and when scatter decomposition works. Ojtr first result is that if 
correlation in workload is a convex function of distance, then scattering a more finely decomposed domain 
yields a lower average processor workload variance. Our second result shows that if the workload process 
is stationary Gaussian and the correlation function decreases linearly in distance until becoming zero 
and then remains zero, scattering a more finely decomposed domain yields a lower expected maximum 
processor workload. Finally we show that if the correlation function decreases linearly across the entire 
domain, then among all mappings that assign an equal number of domain pieces to each processor, scat- 
ter decomposition minimizes the average processor workload variance. The dependence of these results 
on the assumption of decreasing correlation is illustrated with situations where a coarser granularity 
actually achieves better load balance. 


*This research was supported in part by NASA contracts NAS1-18107 and NASl-18605, and NSF Grant ASC 8819373. 
t Supported in part by NASA contracts NASl-18107 and NASl-18605, the Office of Naval Research under contract No. 

N0001 4-86- K- 03 10, and NSF grant DCR 8106181. 
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1 Introduction 


Scatter decomposition [1], (also described as modular mapping [4]) is an effective method for parallelizing 
a large class of irregular scientific programs that are tied to physical domains. Examples include a wide 
variety of techniques for numerically solving time dependent partial differential equations, and other, less 
numerical domain-oriented simulations. Scatter decomposition divides the domain into a set of rectangular 
regions with the same spatial size and geometry. The regions are labeled using Cartesian coordinates, and 
are mapped to processors by applying the mod function to the label in each coordinate. For example, Figure 
1 shows how a two dimensional irregular grid for a PDE is decomposed into strips (marked by the heavy 
lines) and assigned to processors. The execution of all workload related to a subregion is a basic unit of 
schedulable work which we call a cluster . A cluster’s granularity is controlled by the parameters defining the 
region size, in this case the strip width. 

Scatter decomposition’s success lies in its ability to balance workload without ever actually analyzing it. 
Any region of high workload tends to be subdivided and distributed among processors. Scatter decomposition 
is a technique applied to many problems in many contexts [1, 2, 4, 5, 9, 11, 14, 17]. Its success has been 
explained informally in [1] and [4], by appealing to the physics and numerics of many scientific computations. 
While these explanations suffice for most practitioners, the literature lacks a full formal analysis of why scatter 
decomposition balances workload. This paper provides some such analysis, identifying model assumptions 
under which scatter decomposition can be expected to effectively balance load. As such, our work is a 
necessary prerequisite for any future formal treatment of the very important problem of managing the 
inherent tensions between load imbalance and communication costs in a scatter decomposition. 

The object of this paper is to construct and analyze a performance model to explain when and why scatter 
decomposition works. The model is based on a number of simplifying assumptions to promote tractability. 
As such, it should not be viewed as a model that accurately -predicts performance quantitatively. Rather, 
it should be viewed as a model that explains performance qualitatively. Specifically, we model workload in 
a one dimensional domain as a continuous second-order stationary process. This means that we associate 
a random workload with every point in the domain, assume that the mean workload at every point is the 
same, assume that the workload variance at every point is the same, and assume that the covariance between 
the workloads at any two points is uniquely determined by their distance. The model takes the domain to 
be divided into some n — 2 d clusters of equal size, mapped modularly onto P = 2 P processors. Throughout 
this paper we take P to be fixed, and d > p. The degree of the decomposition is defined to be d. Given one 
scatter decomposition, another of higher degree can be constructed by splitting each cluster into two, then 
by modularly mapping the resulting set of clusters. 

We derive three main results, each of which has a different set of assumptions concerning the correlation 
function. 
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1. Assumption: The correlation function is convex. Result: Increasing the degree of a scatter decompo- 
sition does not increase the common processor workload variance. 

2. Assumptions: The workload process is stationary and Gaussian. The correlation function decreases 

linearly until reaching zero, then remains zero (an elbow function). Result: There exists a degree 

do, such that if do < d\ < d^, then the expected maximum processor workload under a scatter 
decomposition of degree is no larger than the expected maximum processor workload under a 
scatter decomposition of degree d\. 

3. Assumption: The correlation function decreases linearly across the entire domain. Result: For any 
number of clusters 2*, among all mappings that assign 2*“ p clusters per processor the modular mapping 
minimizes the average piocessor workload variance. 

Performance ultimately is measured in terms of finishing time, so that the expected load of the most 
heavily loaded processor is an appropriate metric. One of our results addresses this metric directly. Average 
processor workload variance is a secondary measure, although intuition does suggest that decreasing the 
variance while keeping the mean constant will decrease the expected maximum. Consequently, all these 
results confirm our intuition that modularly mapping increasingly finer grained workload leads to better 
load balance. It should be noted that increased communication overhead is the price paid for this balance, 
and is a cost we do not include in this model. One should not interpret these results as saying that better 
overall performance can always be achieved by increasing the degree. For a given domain, there will be an 
optimal degree that balances the conflicting goals of low communication costs and good load balance. 

A brief analysis of scatter decomposition can be found in [15]. However, that analysis assumes statistical 
independence between all cluster workloads, and seems to consider the effects of scatter decomposition on 
a given architecture as the problem size is increased. As such it is an inappropriate model for studying the 
effects of changing the mapping of a single given problem. Treatments of other problems have used stochastic 
models of workload to estimate the expected finishing time; but invariably those models concern statistically 
independent workloads, e.g. the analyses in [3] and [6]. These results are inadequate for analyzing scatter 
decomposition. When all workload is independent, then aggregated workload is independent, and there is 
no performance benefit to be gained by scattering. Scatter decomposition is successful precisely because the 
workload is not independent. Our contribution is to propose and analyze a model that includes workload 
correlation, and explain why increasingly finer partitions mapped modularly tend to balance the load better. 
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2 Analysis 


In this section we study a probabilistic model of workload, and the performance of different mappings. For 
the sake of simplicity we constrain our model to be one-dimensional. This assumption does not negate 
the utility of the model; any multi-dimensional problem partitioned into hyper-strips can be viewed as a 
one-dimensional problem. Such partitions greatly simplify the programming needed to exchange information 
between processors. In fact, our experience in mapping a land-battle simulation using scatter decomposition 
was that strip partitions minimized the execution time [10]. This was also our experience in mapping a 
regular scientific code onto the Intel iPSC/1 [16]. 

Our analysis concerns the effect of scatter decomposition on load balance, in the absence of commu- 
nication or synchronization costs. By understanding how load balance in isolation is affected by the de- 
composition/mapping decisions, we are better able to understand the tension between load imbalance and 
communication/synchronization overheads. The model we use is intended to be descriptive, rather than 
predictive; the analysis is qualitative rather than quantitative. We doubt that the end benefits of fitting a 
model to performance data will justify the costs of doing so. Nevertheless we feel there is worth in formally 
affirming the intuition behind scatter decomposition. 

2.1 When and Why Scatter Decomposition Works 

Our model explains the success of scatter decomposition by showing that it induces correlation between 
processors’ workloads. To see the performance benefits of correlated workloads, imagine that a random 
workload is generated and partitioned so that the same amount of work is assigned to every processor. A 
processor’s workload is random, but all processors always finish at the same time, because their workloads 
are perfectly correlated. This situation is optimal, because all processors are busy all the time. Now 
imagine that the workload at every point is statistically independent of any other. No matter what the 
domain decomposition or mapping, processor workloads are statistically independent. In fact, the expected 
maximum processor workload is the same regardless of granularity, so long as the same volume of domain 
is assigned to each processor. The “ideal” of random but highly correlated processor workloads cannot be 
achieved in this artificial scenario. 

Scatter decomposition works because irregular workloads are not statistically independent: high workload 
tends to appear in contiguous regions. A sufficiently fine-grained decomposition will split the region up, 
modular assignment will spread its workload around. The contribution of that region to one processor’s 
workload is highly correlated with the contribution of a nearby region to a different processor’s workload. If 
the underlying workload is highly correlated in nearby regions, then scatter decomposition induces correlation 
between processors’ workloads. We have observed this phenomenon in our own experiments with a one- 
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dimensional fluid flow computation using adaptive gridding [11], The fluids problem exhibits irregular grids 
similar to those in Figure 1. 

For a given problem, the sample autocorrelation function [12](p. 437) is a statistical estimate of correlation 
between point workloads, as a function of the distance between them. Autocorrelations range between 1 
and -1; the larger the autocorrelation, the more similar the workloads of two points at a given distance tend 
to be. Zero correlation implies statistical independence; increasingly negative correlations imply increasing 
dissimilarity between workloads. Figure 2 shows the sample autocorrelation function at one time-step in a 
fluid flow computation. Not only does correlation diminish as a function of distance, it can reasonably be 
modeled as a convex “elbow” function d a (t) = <j 2 max{0, 1 — a/} over an appropriate range of t, and some 
a > 0. This corresponds nicely with two of our results, one of which assumes elbow correlation, the other of 
which assumes a convex correlation function. 

There are situations where scatter decomposition will not work well. Consider a one dimensional domain 
discretized into 1000 points, numbered between 0 and 999, to be mapped onto ten processors. Randomly 
choose some “base” number 6 € [0,99], and imagine that every hundredth point beginning with b has a 
computational cost of 1000, while all other points have a computational cost of 1. If one evenly divides the 
domain into ten subregions and maps them modularly, every processor has 1099 units of computation to 
execute. Scatter a decomposition of twenty subregions, and half the processors each have a computational 
cost of 2098, while the other half each have a cost of 100. Modularly assign each point individually, and 
processor (6 mod 10) has a cost of 10090, while every other processor has a cost of 100. In this situation 
mapping increasingly finer-grained workload leads to decreasing performance. Due to b } s randomness this 
workload model is stochastic, and is second-order stationary. Two points at a distance 100m for m = 1, . . . ,9 
will always have the same workload. The correlation function at all distances 100m consequently has value 
one. It has some fixed smaller value for all other distances. The principle reason this problem defeats fine- 
grained scatter decomposition is the periodicity. One should be extremely careful using scatter decomposition 
in the presence of strong periodic behavior, if there is any chance that the periodicity of the modular mapping 
can align with the periodicity of workload. The assumptions of the models we study do not admit periodicity. 

2.2 Model Preliminaries 

We consider the behavior of a computation over a real line interval, divided into n clusters, and mapped onto 
P processors. Both n and P are taken to be powers of two, and n > P. We are interested in the average 
processor workload variance, and in the expected workload of the processor that takes the longest time to 
complete. Without loss of generality we take the real interval to be [0, 1]. Assume that every point p E [0, 1] 
has a certain work intensity W(t). The time required to process [a, 6] is the integral of W{t) from t = a to 
t = 6. We assume that the intensities W{t) are unknown, but we are willing to model our uncertainly by 
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assuming that W(t) is a random variable, and that W[t ) can be viewed as a second-order stationary process 
[13] over t € [0, l]. Thus we suppose that E[W{t)] = /i for all t £ [0, 1], that Var[W(t )] = <r 2 for all t 6 [0, 1], 
and that Cov\W({) y W{s)] depends only on — s |. To emphasize this point we will denote the covariance 
function as Cov(\t — s\). These assumptions are reasonable if we are unwilling or unable to differentiate 
between the likely behavior of the computation at t and at s. We do not assume that W(t) = W(s), we 
simply assume that we have the same degree of uncertainity about W(t) and W(s). 

The execution time for [a, 6] is 


T(a, 6) = f W(t) dt. 

J a 

T(a y b) has mean value ( b - a)/i. The variance of T(a y b) is 


Var[T(a y 6)] 


E[T(a,bY}-(b-a) 2 n 2 

rb 


-(6-a)V 


( [ w{t) dt)( f vy( s ) ds ) 

J a J a 

rO j 'b 

/ / E[W(t)W(s)) dt ds - (6 — a) 2 /r 

J (x J a 

f j Cov(\s — t\) dt ds — (b — a) 2 fi 2 , 
J a J a 


(i) 


Following a decomposition into n clusters, the ith cluster’s workload is T(i/n y ( i + l)/n), and is denoted 
as Ci(n). The random vector of cluster workloads is denoted C(n) =<Co(n),...,c n _i(n)>. 

We are interested in the covariance matrix for the cluster workloads. For i ^ j we have 


r(*+ 1 )/« /*(;+!)/« 

Cov[ci(n) y Cj(n)] = (cr 2 ) {j = / / Cov(t - s) dt ds. (2) 

d ijn J jjn 

^ar[c,(n)] is simply Var[T(i/n y (i + l)/n)], given above. The sequence c 0 (n),ci(n), . . . ,c n _!(n) is second- 
order stationary, a fact easily deduced from equations (1) and (2). To emphasize this we define the function 

4>: 

*|,n) = Cow[c<(n),Cj-(n)]. 


Note that <f>(0 y n) is a cluster’s variance. 

An assignment of clusters to processors is described by a P x n assignment matrix whose ij-th entry 
is 1 if Cj (n) is assigned to processor z, and is 0 otherwise. Given assignment matrix *4, the multiplication 
.4C yields a P x 1 random vector whose jth component is the sum of the execution times of all clusters 
assigned to processor j. The vector of mean processor loads is the matrix- vector product Ap n , where is 
the n element vector with /i/n in each coordinate. The covariance matrix of AC is the product Aa^A T y 
where A T is the transpose of A. The overall execution time is the maximum processor execution time, or 
max{(*4C) T }. This quantity is random. 
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For any processor P it let .4(i) denote the set of clusters assigned to it under A, and let Li (A, n) be Pi’s 
random workload. By definition the variance of Li(A,n) is given by 


Var[Li(A,n)} = (Aa 2 c A T ) ii 

= <t>{ 0,n)+ E )• ( 3 ) 

c,>(n)€A(i) <c i (n),c fc (n)>6>i(0 x ^(0 

The first component of this expression is the sum of variances of all clusters assigned to P{. The second 
component is a sum of cluster covariance terms (we will call these cc terms), that depends on the assignment. 
Similarly, the covariance between processors Li(A> n ) and Lj(A,n) is given by a sum of cc terms: 


C2 ov [Zfj (•'4, n), Lj(A ) n)] — E <f>(\k — m|, n) 


( 4 ) 


The sum of all cluster 
and covariances 1 


<c k (n),c m (n)>^A(i)xA(j) 

covariance matrix terms always equals the sum of all processor workload variances 

n — 1 n — 1 P— 1 P— 1 

£ £«>« = £ 
i=o i= o *=o j=o 


This implies a balance between processor workload variances and covariances (and hence correlations); if by 
changing A we reduce the average processor workload variance, then we are increasing the average inter- 
processor workload correlation. 

The indices of the sums (3) and (4) have special structure when A describes a modular mapping. We 
know that if cj(n ) and c k (n) are assigned to the same processor, then | j - fc[ is a multiple of P. Under a 
modular mapping each processor will have n/P clusters. Among these there are n/P - 1 pairs of clusters 
whose indices are exactly P apart, n/P- 2 pairs whose indices are exactly 2 P apart, and so on. Since n and 
P determine the specifics of the mapping we may drop the notational dependence of Li(A, n) on A . Under 
the modular mapping we may write the common processor workload variance as 


(n/P )- 1 

V ar[L(n)} = (n/P)<f>(0 } n) -j- 2 {(n/P) — k)<j>(kP, n). 

k-\ 


(5) 


To consider processor workload covariance under a modular assignment take i < j , and consider a 
cluster c a (n) assigned to processor P,. It has cc terms with all processor Pj clusters c m (n) such that 
| fl - m \ mod P = j - % or \a - m\ mod P = P - j + i. There are ((n/P) - k) cc terms arising from clusters 
whose indices are kP + j - i apart (for k = 0, . . . , (n/P) - 1); there are ((n/P) - *) cc terms arising from 
clusters whose indices are kP — j A i apart (for k = 1, . . . , (n/ P) — 1). We may therefore write 

(n/P)- 1 (n/P)-i 

Cov[Li(n), Lj(n)] = (( n /P) “ k)<t>(kP 4- j — *\ n) + ((n/P) — k)</>(kP — j + i, n) 

t=o *=i - 

1 this conservation law proved to be invaluable when debugging detailed expressions for the processor workload variance and 
covariances, e,g, (12) and (13). 
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( 6 ) 


(n/P)-l 

= (n/ P)<p(j - i, n) + ((«/ P ) - k)<p(kP + j - f ,-n) + 

k = 1 

(n/P)-l 

X] {(n/P)-k)<p{kP-j + i,n). 

k = l 


2.3 Decreasing Workload Variance 


Under very general assumptions one can show that increasing the degree of a scatter decomposition reduces 
the common processor workload variance. The necessary assumptions are that the workload process be 
second-order stationary, and that its covariance function be convex. 

The first step is to show that <j>Qj — i|, n) is a convex function of | j — i| over the range 1,2,. . . , n — 1. 
Towards this end assume that x > 1/n and define 


I{n>x) = 


rl/n /-ar+1 fn 

E I / W(s)W(t) dt ds 
Jo Jx 


rl/n y*X-fl/n 

= I / Cov(t — s) dt ds 

Jo Jx 

rl/n f" r oo roo 

— \ Cov(t — s) dt — / Cov(t — s) 

Jo [j x Jx+lfn 


dt\ 


ds 


Taking the derivative with respect to x we find that 

d f 1/n 

— I(n t x) = / (Cov(x -f 1/n — 5 ) — Cov(x — s)) ds . 
ox Jo 

The difference being integrated increases in x due to Cov(t) convexity, implying that the derivative of I(n,x) 
with respect to x increases in x — one characterization of a convex function. By stationarity Cov[c, (n), cj(n)] = 
CoufcoCn), Cjj_,*|(n)]; furthermore Coi»[c 0 (n), C| ; *_j|(n)] = /(n, \j — z |/ rt) . Consequently Cov[ci(n),Cj(n)] is a 
convex function of | j — t| once \ j — i\ > 1 (it may indeed be convex over the entire range, but that fact has 
not been shown, and is not needed). 

We are interested in the effects of moving from a scatter decomposition with degree d — 1 to one with 
degree d. To analyze these effects we make the following observation. Consider a domain partitioned into 
n = 2 d clusters, which is mapped by modular ly assigning pairs of clusters: co(n) and Ci(n) are assigned to 
processor 0, C 2 (n) and C 3 (n) are assigned to processor 1, and so on. This mapping is identical to the scatter 
decomposition of degree d — 1; the pair of clusters c 0 (n), Ci(n) viewed from the d degree mapping is the same 
as the single cluster co(n/2) viewed from the d— 1 degree mapping. We will show that the modular mapping 
with degree d — 1 produces processor variances that are no smaller than those of the modular mapping with 
degree d. 

Split each cluster Cj(n/2) into two equal sized clusters. The sum of the two split cluster variances plus 
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twice their covariance must equal the variance of Cj(n/2). That is, 

n/2) = 2<^(0, n) 4 2^(1, n). (7) 

Similarly, take two clusters Ci(n/ 2) and Cj(n/2), and split each into two equal sized clusters. The total 
covariance between the four split clusters must equal the covariance between the two unsplit clusters. Thus 

4>(\j ~ *‘l.«/2) = 2<£(2i j ~ iU) + 0(2|j - i| + l,n) + - i| - l,n). (8) 

Note that the index values must double when taken with respect to n rather than n/2 clusters. 

Substituting the right-hand-sides of equations (7) and (8) into equation (5) and working through the 
algebra, we find that 

(n/(2P)) — i 

Var[L(n/ 2)] = (n/P)0( 0, n) + {n/P)4>{ 1, n) + 2 ^ (( njP ) - 2k)<j>(2kP, n)+ 

fc = l 

(n/(2P))-l 

E ((^/P) - 2fc) [<^>(2fcP 4 1, n) 4 <f>(2kP - 1, n)] . 

k-l 

Using this expression and (5), we compute the difference Var[L{nf 2)] - Uar[L(n)]. All terms involving 
0(2&P, n) cancel, for i = 0, . . . , n/(2P) — 1. Each remaining term from Var[L(n)\ has the form 2((n/P) — 
2k — l)<f>((2k 4- 1)P, n), for k = 0, ... , n/(2P) — 1. We split each such term into the sum (n/P — 2k)<f{{2 k 4 
l)P,n) 4 ( n/P — 2k — 2)^((2Ar 4 1)P, n), and pair these with Var[L{nf 2)] terms as follows: 

(n/(2P)) — 1 

Var[L{n/2)]-Var[L{n)\ = £ ((«//»- 2*)(0(2fcP + l,n) - ^((2t + l)P, n)) - 

(n/P — 2* — 2) (<fi((2k 4- 1)P, n) - <£((2* 4- 2)P - 1, n)) ) . (9) 

One characteristic of a convex function g is that for fixed y the difference g(x) - g(x 4 y) is a decreasing 
function of x . Every two terms we have paired differ in their index arguments by exactly P — 1, e.g., 
<j>{2kP 4- 1, n) and <fi((2k 4 1)P, n). Since <f> is a convex function of the index argument once the index is at 
least 1, we have for every k 

<j>(2kP+ 1, n) — <j>((2k 4 1)P, n) > <f>{{2k 4 1)P, n) - <j>{{2k 4 2)P - l,n). 

The left-hand-side expression in this inequality is weighted more heavily in equation (9) than is the right- 
hand-side expression. It follows that Var[L(n/ 2)] - V ar[L{n)\ > 0, proving our first result. 

Theorem 1 Suppose the workload process W(t) is second-order stationary with a convex covariance func- 
tion. Then increasing the degree of a scatter decomposition does not increase the processor workload variance. 

□ 
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2.4 Decreasing Expected Maximum Workload 


Next we demonstrate circumstances where increasing the degree of a scatter decomposition reduces the 
expected workload of the most heavily loaded processor. The argument is to show that under appropriate 
assumptions the correlation between any two processors’ workloads increases as the degree increases. We 
then cite a result from the literature proving that the expected maximum decreases in this situation. 

We assume that the workload process {W(t)} is a stationary Gaussian process 2 [7]. Additivity properties 
of the Gaussian then ensure that the vector of n clusters has a jointly normal distribution [7](Chapter 6) and 
that under any assignment, the processors’ workloads are jointly normal. We also assume that the correlation 
function is Cov(t) = cr 2 max{0, 1 - at }, where a = 2 v /m a > 1 for some integers v } m Q > 0. The restriction 
on a is used to simplify certain calculations. <5 = 1/a is the smallest distance t at which Cov(t) = 0. Our 
results apply when the degree is large enough so that subinterval [0, 5] is partitioned into at least P — 2 P 
clusters. If the degree is d, then the number of clusters in [0,(5] is 62 d . Now let do be the least d such that 
82 d mod 2 P = 0. Equivalently, d 0 is the least integer d such that m a 2 d ~ p ~ v is an integer. Clearly d 0 < p+v. 
Our results apply when the degree is at least do- 

We can compute functional forms for - i\ y n) given this explicit definition of Cov(t). Performing the 
integration given by (2) one determines that 


H\j - *l> n ) 


£y(n - a\j - i|) 


6n 3 


if |j - i | < 6n 


if | j ~ i\ ~ $ n • 


( 10 ) 


0 if |j — i| > 6n 

These calculations take advantage of the fact that <5 is a multiple of 1/n. The variance of a cluster is 
determined by evaluating (1), yielding 


^(0, n) = — ^(n — a/3). 


( 11 ) 


Given equations (10) and (11) we can compute processor workload variance and covariance under scatter 
decomposition. General expressions for these quantities are given by (5) and (6). For large values of k , some 
terms in those sums vanish, being zero. Our assumption that the scatter decomposition has degree do or 
larger ensures that terms which vanish are easily characterized, 3 and that those clusters whose indices are 
exactly Sn apart are assigned to the same processor. All <f>(kP y n) terms in (5) vanish for k > 8n/ P\ we have 
<t>(kP y n) — <j 2 a/(6n 3 ) for k = 6n/P . We may rewrite the variance as 

Fa. [!(»)] = (<„//> - *) ' V + ("IP - HP) (ff 

1 1 >1 ' ' > 


k-l 


2 note that this assumption is stronger than we have used so far, due both to station&rity rather than second-order stationarity, 


and due to the assumption of a specific workload distribution 

3 This is not the case for smaller degrees, A large number of special cases must be constructed and analyzed. This task 
seemed to us to be more tedious than is warranted by the anticipated correspondingly stronger result. 
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( 12 ) 


,2 f (*-*73) 1 -a/P l-*\ 

V F 2 3n 2 3n 2 P ) * 


Calculation of this equality is much simplified with the use of a symbolic mathematics package. 

The processor workload covariance is similarly handled. Assume that i < j. k = Sn/P again delineates 
where <j> terms vanish: <f(kP + j — t, n) = 0 for all k > <5n/P, and <f>(kP — a + j, 0) == 0 for all k > 6n/P. We 
may rewrite ( 6 ) as 


(*n/P)-l 


CovlL^Ljin)] = (n/P)^(n-a(j-i))+ £ ((n/P) - k) 


<r 2 (n — a(kP + j — i)) 


+ 


k = l 


5n/P 


£ ((n/P) - *) 


cr 2 (n - a(fcP — j' 4* »)) 


Jfe=i 


_2 [ ^ j i l 

1 P 2 3n 2 Pn 2 J ' 


(13) 


The correlation between P t (n) and Lj(n ) is the ratio Cov[Li(n)> Lj(n)]/Var[L(n)]. For all d > do we 
obtain the correlation using (13) and (12), and can treat the ratio as a continuous function of n. It is 
interesting to note that as n increases the correlation approaches unity* This supports our intuition that 
partitioning the domain into increasingly finer clusters and mapping them modularly induces correlation 
between processor workloads. In fact, the tendency towards unity is monotonic. Taking the derivative with 
respect to n we find that the derivative is positive if 


(4/3 - 2£/3)(j - i) + 2 <5/9 - 2/3 > 0. 

This inequality holds, since (4/3 — 2<5/3) > 2/3. Consequently, for all n — 2 d > 2 d ° we must have 
Cov[L i (2n),L j (2n)]/Var[L(2n)] > Cov[Li(n)> I>(n)]/Var[L(n)]. 


Next we use this relationship to analyze the expected maximum processor workload. 

The following result is based on the Normal Comparison Lemma [8](p.81) and is the key to our observa- 
tions concerning the expected maximum processor workload. 

Theorem 2 (Leadbetter et al.) Let £o> •■•>£* £ e standardized jointly normal random variables , and let 
770 , . . . , 77 * be standardized jointly normal random variables } such that Cot;(£ t -,fj) < Cov{tji ) rjj) for each i,j } 
i j . Then for every u, 

Pr {max{£oi • • • , 6 } < «}} < Pr {majc{Tj 0 , •••,%}<«}}, 


and hence 

£'[max{^ 0 ,---, 6 }] > P[max{i 7 o, . . . , 77*}]. 
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□ 

The standardization of a random variable X is the scaled random variable Z = (A r — m)/s, where m 
and $ are X’s mean and standard deviation, respectively. The mean of a standardized random variable is 
zero and its variance is one; the covariance between two standardized random variables is the correlation 
between their corresponding unstandardized forms. Let Zi{n) be the standardized workload of processor Pi 
given a domain of n clusters. Cov[Zi(n), Zj(n)] = Cov[Li{n), Lj{n)]/Var[L{n) ), which we have shown to be 
increasing in n. If h > n (equivalently, if one scatter decomposition has higher degree than another), then 

£[max{Z 0 (n), . . . , Z P „i(n)}] > E[ma.x{Z 0 (h), . . . , Zp_i(n)}]. (14) 

The expected maximum workload is 

£ f [max{L 0 (n), . , , , Lp-i(n)}] = E[ max {L,*(n) -f- Var[L(n )] 1 ^ 2 Zi(n)}] 

0<i<P—l 

= n/P + V r ar[L(n)] 1 / 2 £;[ o< mM_ i {Z l (n)}]. 

Theorem 1 shows that V ar[L{n)] > V ar[L(2n)]\ this along with inequality (14) proves our second result. 

Theorem 3 Let {^(t)} be a stationary Gaussian process, with a covariance function Cov(t) = <r 2 max{0, 1 — 
ott }, where a — 2 v /m a > 1 for some positive integers m a , v. Let there be 2 P processors, and let do be the 
least integer d such that m a 2 d ~ p ~ v is an integer . 7/^2 > d\ > do, then the expected maximum processor 
workload of a scatter decomposition with degree c ?2 is no greater than that of a scatter decomposition with 
degree d\ . 

□ 


2.5 Minimization of Average Workload Variance 

Our final result gives conditions where for a given n, among all “balanced” assignments — those placing nfP 
clusters per processor — the modular mapping minimizes the average processor workload variance. To prove 
this result we assume that the covariance function decreases linearly across the entire domain: Cov(s) = 
cr 2 (l — as), for some a satisfying 0 < a < 2. The result is based on a procedure that takes any assignment 
and constructs another whose sum of processor workload variances is no larger. The repeated application of 
this procedure produces a modular assignment. Consequently, modular assignments minimize the average 
processor workload variance. 

The arguments to follow specify individual covariance terms. These arguments are clearer using the 
Cov[ci(n),Cj(n )] notation rather than *|,n). It is straightforward to determine the formof Cov[ci(n),Cj(n)] 
under the present assumptions: 

£{n - a\ j - i|) if |j - i\ > 0 
r(n - a/3) if | j — t| = 0 


Cov[ci(n),Cj(n )] = 


cr 

TV 
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(15) 


Let Ax be any assignment matrix describing a balanced assignment. Without loss of generality, we 
assume that under A\ the processors are numbered so that Pq is assigned co(n), Pi is assigned the smallest 
indexed c t (n) that is not assigned to P 0 , and in general Pj is assigned the smallest indexed cluster that is 
not assigned to any of Pq , Pi , . > . , Pj - 1 . 

We will say that Cj(n) is in place if it is assigned to processor Pj mo d p* Note that all clusters are in place 
under a modular assignment. We construct another balanced assignment A 2 by finding the smallest indexed 
c,(n) that is not in place, and by putting it in place. Let cj denote this cluster, let P$ denote the source 
processor that has c/ under A\ y and let Pt denote the target processor P/ mo d P * Let c g be the smallest 
indexed cluster assigned to Pt such that g > /. A2 is constructed from A\ by giving cj to Pj, and c g to P5. 
Figure 3 illustrates these definitions. We will prove that the sum of processor variances under A 2 bounds 
that sum under Ai from below; consequently the average workload variance under A 2 is no greater than 
that under Ai. 

Recall that under any assignment matrix A the variance of P*’ s work load is given by 

Var[Li(A,n )] = (A<t 2 c A t ) h 

= Y2 V ar[cj(n)] + Cov[cj{n),c k {n)), (16) 

Cj(n)€^(») <cj(n),c fe (n)>€^(*)x^(0 

and that 

Cov[Li(A,n),Lj(A,n)]= CW[cib(n), c m (n)] 

It is clear from (16) that the variance of any processor other than P5 or Pt is by unaffected by swapping cj 
and c g . To prove the desired result we need only show that the swap does not increase the sum of P7 1 and 
Ps variances. The change in processor variances caused by the swap is entirely due to changes in the sum of 
cluster covariance (cc) terms in each processor. After swapping cj and c gy each cluster c,(n) assigned to Ps 
loses the cc term Cov[cf(n), Cj(n)] and gains the term Coulc^n), c t (n)]. We let A denote the sum of all 
such changes among clusters in Ps to the left of c/ , and let Ls denote the number of such clusters. Similarly 
A r s denotes the sum of changes among clusters in Ps to the right of c g and Rs denotes the number of such 
clusters; A m s denotes the sum of changes among clusters in P s with indices between / and g . Expressions 
for these quantities are derived using equation (15) 

0.2 

A Ls = YL (Cov[c g (n) } Ci(n)]- Cov[c f (n) } Ci(n)]) = -~s(g - f)L s oc; 

c,(n)€^ 3 (S) 

*</ 

3 2 

A Rs = (Cov[cj(n),c j (n)]-Cot;[c / (n),c > (n)])= ^3-(<;-/)fisa; 

Cj(n)€^3(5) 

j>9 

2 

A m s = YL (Cov[c g (n), c fc (n)] - Cov[c ; (n), c*(«)]) = ^ ^ ( 2 * - / “ ?)<*■ 

‘=k(») 6 ^ a (S) c k (n)€-A a ( 5 ) 

}<k<9 /<*<5 
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The change in /Vs variance after the swap is the sum A l s + Am s + A r s . 

We can similarly describe the change in Pt's variance with the definitions 

2 

A Lt = ^2 (Cov[cj(n),a(n)] - Cov[c g {n), c,(n)]) = — (g - f)L T at ; 

ci(n)€^ a (T) 

*</ 

^ 2 

A Rt = X) (C°v[c f (n), c j(n)]-Cov[c g (n),c j (n)]) = -—(g-f)R T a. 

Cj€A 3 (T) 

3>9 

No term analogous to A m s is necessary since there are no clusters in Pt with indices between / and g. 

The change in the sum of /Vs variance with P ? 9 s variance is given by the sum of all the A terms 
defined above. We will show that the sum of A terms is bounded from above by 0. At this point a number of 
observations are useful. Since all Ci(n) with i < f are in order, it follows that Lt < Ls . Thus Ax, s + A£, T < 0. 
It remains to show that Ar s -f A^ T -f Ajvf 5 < 0. We know that 

2 

+ Ar t = — - Rs)(g- /)<*; (17) 

furthermore, since n/P — L? + Rt+ 1> we must also have Rs < Rt- We proceed to show that the magnitude 
of A m s i s no greater than the magnitude of (17) and consequently prove the larger result. 

77i — 7i/P — Ls — Rs — 1 is the number of clusters in Ps whose indices lie strictly between / and g. A m s 
is maximized when the indices of these clusters are as large as possible; when k — g — 1,0 — 2, — m. 
With such indices, the sum of c g ’s cc terms in Ps is 

2 m 

^X> -*’•«)• 

71 »=1 

Likewise, the sum of c/’s cc terms in Ps is 

2 to 

^3 5> - (9 - f - *»• 
i= 1 

From this, we see that Aa/ s when maximized can be written as 

2 to 2 to 2 

Am s = ^3 “ 1 ' Q ) ~ “3 X( n ~(9~ f - *» = “3 m (ff “ /)“’ 

1=1 1=1 

But note that 

m = n/P-Ls — Rs — 1 
< n/P — Lt — Rs ~~ 1 

= (n / P — Lt — Rt ~ 1) + (-Rt “ Rs) 

= (Rt - Rs) ? 
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so that 

2 

Ah s + Ar t + Aa/ s = ^3 (— (Rt ~ Rs)(9 “ /)<* + m • (y — /)a) < 0. 

Consequently, swapping cj and c g does not increase the sum of Ps and Pj ’ s variance. Furthermore, the 
swap does not affect the sum of other processors’ variances. Repeatedly applying this procedure puts every 
cluster in place, which is the modular assignment. This discussion has proved the following theorem. 

Theorem 4 Let {W(t)} be a second-order stationary process, with a covariance function Cov(s) = cr 2 (l — 
as), where 0 < a < 2. Let P and n be given such that P divides n evenly , and let Am be the P x n 
assignment matrix describing the modular mapping. Then for any P x n assignment matrix A describing a 
balanced assignment, 

p - l p - i 

(1/P) ^ (Am^cAm T )h < {1/P) 

j =0 *=0 

□ 

In the event that the workload process is Gaussian and stationary, we can show that increasing the degree 
reduces the expected maximum processor workload. We determine the processor variance and covariance 
under scatter decomposition by substituting the values given by (15) into (5) and (6). Assume that i < j. 
Working through the algebra one determines that 

and that 

+ j? - ■ 

The derivative with respect to n of C[Li(n), Lj(n)\/Var[L(n)\ is positive if 

(4/3 - 2a/3 )(; - i) + 2a/9 - 2/3 > 0. 

This is always true over the range a E [0,2]. Consequently the same arguments used to prove Theorem 3 
can be applied here. 

3 Summary 

Scatter decomposition is an attractive method for mapping domain-oriented computations with irregular 
workloads to parallel architectures. Scatter decomposition partitions the domain into n equal-size pieces, 
and maps them modularly onto P processors. This paper uses a formal probabilistic model of correlated 
workload in a one-dimensional domain to explain why and when scatter decomposition works. First, we 
show that periodicity in workload correlation can lead to load imbalance under scatter decomposition if the 
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correlation period aligns with the period of the modular mapping. Consequently we consider nonperiodic 
workload correlation functions. 

Our first result shows that if workload correlation is a convex function of distance, then scattering with 
increasingly finer grained clusters decreases a processor’s workload variance, thereby increasing the average 
inter-processor workload correlation. Since the processor workload mean is unaffected by this change, one 
anticipates that the expected maximum workload will correspondingly decrease. 

Our second result affirms this intuition under a stronger set of assumptions: the workload process is 
Gaussian, and the correlation function decreases linearly in distance until it reaches zero and then stays at 
zero. We then show that once a scatter decomposition is sufficiently fine-grained, making the grain-size finer 

reduces the expected maximum processor workload. 

Our third result shows that under slightly different assumptions still, among all possible balanced 
mappings scatter decomposition minimizes the average processor workload variance. This result depends on 
the correlation function decreasing linearly across the entire domain. In this case it is also true that if the 
workload process is Gaussian, then scattering a finer-grained decomposition reduces the expected maximum 
processor workload. 

These analytic results serve to formally verify the intuition behind scatter decomposition. However, 
the results only concern load balance. The additional communication cost of decreasing granularity is 
not built into this model. Extensions to this work might find the optimal granularity by determining 
a quantitative estimator of the expected maximum workload and the expected communication cost as a 
function of granularity. An overall execution time model would be constructed depending on the influence 
of architecture on the communication costs, and then optimized. 
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Figure 2: Correlation as function of distance in ID fluids problem 
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